Goto

Collaborating Authors

 gold target


Conformal Linguistic Calibration: Trading-off between Factuality and Specificity

Neural Information Processing Systems

Language model outputs are not always reliable, thus prompting research into how to adapt model responses based on uncertainty. Common approaches include: abstention, where models refrain from generating responses when uncertain; and linguistic calibration, where models hedge their statements using uncertainty quantifiers. However, abstention can withhold valuable information, while linguistically calibrated responses are often challenging to leverage in downstream tasks. We propose a unified view, Conformal Linguistic Calibration (CLC), which reinterprets linguistic calibration as answer set prediction. First we present a framework connecting abstention and linguistic calibration through the lens of linguistic pragmatics. We then describe an implementation of CLC that allows for controlling the level of imprecision in model responses. Results demonstrate our method produces calibrated outputs with conformal guarantees on factual accuracy. Further, our approach enables fine-tuning models to perform uncertainty-aware adaptive claim rewriting, offering a controllable balance between factuality and specificity.1


Supplementary Materials AGMMU: AComprehensive Agricultural Multimodal Understanding Benchmark Aruna Gauba1,2,5 Irene Pi1,3,5 Yunze Man1,4,5 Ziqi Pang1,4,5 Vikram S. Adve1,4,5 Yu-Xiong Wang1,4,5

Neural Information Processing Systems

Our evaluation and analysis are conducted mainly on the group of models listed in Table 2 in the13 main paper. We have chosen models such that they cover most of the popular and best-performing14 methods used by recent multimodal understanding work. In this part, we discuss all the models we15 have used in our experiments and explain their evaluation details, the public checkpoints we have16 chosen, and display the prompts we used to adapt the model to our datasets.17 During evaluation, we chose to follow the standard prompt provided by the authors whenever possi-18 ble for multiple-choice and short-answer questions. When the prompt is not provided for the model,19 we select a custom prompt that is created through several iterations of prompt engineering to select20 the one that produces the most effective results. The images are always included as the prefix.21 We used three proprietary models in our evaluation: GPT-o4-mini [1], Gem-22 ini 1.5 Pro [9], and Claude 3 Haiku [10]. Below we note the model API version used for evaluation.23 GPT-o4-mini: May 13-15, 2025.24 Cambrian-1 is a recent state-of-the-art model that excels at visual-centric tasks.27 This model explores combinations of vision encoders, text and image integration techniques, and28 instruction tuning strategies. We use the official implementation and checkpoint1 with a LLaMA3-29 8B-Instruct LLM backbone model in our evaluation.30 InternVL scales up the vision foundation model while aligning it with the back-31 bone LLM, and is trained on web-scale image-text data to achieve strong performance across a vari-32 ety of vision-centric tasks. We use the official implementation and checkpoint2 with the InternViT-33 300M-448px vision backbone and Internlm2.5-7B-chat LLaMA-3.2 is the first collection of multimodal large language model from the35 LLaMA family that was previously text-only. The integration of vision involves utilizing cross-36 attention layers and a pre-trained vision encoder that feeds directly into the text-processor. The37 model follows a commonly used training recipe that includes pretraining on noisy image-text pairs38 and then high-quality knowledge enhanced pairs. Notably, the language-model parameters were39 frozen during the training of alignment of image and text to retain strong text-only capabilities. We40 use the official implementation and checkpoint3 that uses a LLaMA-3.1 text-only language backbone41 in our evaluation. When evaluating the model, we choose to use a custom prompt since no standard42 prompt is provided.43


AGMMU: AComprehensive Agricultural Multimodal Understanding Benchmark

Neural Information Processing Systems

Unlike prior datasets that rely on crowdsourced prompts, AGMMU is distilled from 116,231 authentic dialogues between everyday growers and USDAauthorized Cooperative Extension experts. Through a three-stage pipeline: automated knowledge extraction, QA generation, and human verification, we construct (i) AGMMU, an evaluation set of 746 multiple-choice questions (MCQs) and 746 open-ended questions (OEQs), and (ii) AGBASE, a development corpus of 57,079 multimodal facts covering five high-stakes agricultural topics: insect identification, species identification, disease categorization, symptom description, and management instruction. AGMMU has three key advantages: Authentic & Expert-Verified: All facts, images, and answers originate from real farmer and gardener inquiries answered by credentialed specialists, ensuring high-fidelity agricultural knowledge. Complete Development Suite: AGMMU uniquely couples a dual-format evaluation benchmark (MCQ and OEQ) with AGBASE, a large-scale training set, enabling both rigorous assessment and targeted improvement of VLMs. Knowledge-intensive Challenge: Our tasks demand the synergy of nuanced visual perception and domain expertise, exposing fundamental limitations of current general-purpose models and charting a path toward robust, application-ready agricultural AI. Benchmarking 12 leading VLMs reveals pronounced gaps in fine-grained perception and factual grounding. Open-sourced models trail after proprietary ones by a wide margin. Simple fine-tuning on AGBASE boosts open-sourced model performance on challenging OEQs for up to 11.6% on average, narrowing this gap and also motivating future research to propose better strategies in knowledge extraction and distillation from AGBASE. We hope AGMMU stimulates research on domain-specific knowledge integration and trustworthy decision support in agriculture AI development.


6c7c9811d06b41b320b69abf37234f84-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing Systems

To quantify this stagnation, we introduce LIVEVQA, the first-of-its-kind dataset featuring 107,143 samples and 12 categories data specifically designed to support research in both seeking and updating with live visual knowledge. Drawing from recent news articles, video platforms, and academic publications in April 2024-May 2025, LIVEVQA enables evaluation of how models handle latest visual information beyond their knowledge boundaries and how current methods help to update them. Our comprehensive benchmarking of 17 state-of-the-art MLLMs reveals significant performance gaps on content beyond knowledge cutoff, and tool-use or agentic visual seeking framework drastically gain an average of 327% improvement. Furthermore, we explore parameter-efficient fine-tuning (PEFT) methods to update MLLMs with new visual knowledge. We dive deeply to the critical balance between adapter capacity and model capability when updating MLLMs with new visual knowledge. All the experimental dataset and source code are publicly available at: https://livevqa.github.io.


Incoherent Beliefs & Inconsistent Actions in Large Language Models

arXiv.org Artificial Intelligence

Real-world tasks and environments exhibit differences from the static datasets that large language models (LLMs) are typically evaluated on. Such tasks can involve sequential interaction, requiring coherent updating of beliefs in light of new evidence, and making appropriate decisions based on those beliefs. Predicting how LLMs will perform in such dynamic environments is important, but can be tricky to determine from measurements in static settings. In this work, we examine two critical components of LLM performance: the ability of LLMs to coherently update their beliefs, and the extent to which the actions they take are consistent with those beliefs. First, we find that LLMs are largely inconsistent in how they update their beliefs; models can exhibit up to a 30% average difference between the directly elicited posterior, and the correct update of their prior. Second, we find that LLMs also often take actions which are inconsistent with the beliefs they hold. On a betting market, for example, LLMs often do not even bet in the same direction as their internally held beliefs over the underlying outcomes. We also find they have moderate self-inconsistency in how they respond to challenges by users to given answers. Finally, we show that the above properties hold even for strong models that obtain high accuracy or that are well-calibrated on the tasks at hand. Our results highlight the difficulties of predicting LLM behavior in complex real-world settings.


AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models

arXiv.org Artificial Intelligence

We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.


Multi-Agent Tool-Integrated Policy Optimization

arXiv.org Artificial Intelligence

Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks. Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses. A natural solution is to adopt a multi-agent framework with planner- and worker-agents to manage context. However, no existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks. To address this gap, we propose Multi-Agent Tool-Integrated Policy Optimization (MATPO), which enables distinct roles (planner and worker) to be trained within a single LLM instance using role-specific prompts via reinforcement learning. MATPO is derived from a principled credit assignment mechanism across planner and worker rollouts. This design eliminates the need to deploy multiple LLMs, which would be memory-intensive, while preserving the benefits of specialization. Experiments on GAIA-text, WebWalkerQA, and FRAMES show that MATPO consistently outperforms single-agent baselines by an average of 18.38% relative improvement in performance and exhibits greater robustness to noisy tool outputs. Our findings highlight the effectiveness of unifying multiple agent roles within a single LLM and provide practical insights for stable and efficient multi-agent RL training.


Can Large Language Models Express Uncertainty Like Human?

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Y et existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we 1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and 2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we 3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we 4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction. The code and dataset are anonymously available at https://anonymous. Large language models (LLMs) are increasingly deployed in real-world applications, from education and healthcare to law and scientific discovery. While their capabilities make them powerful assistants, LLMs are also prone to hallucinations and factual errors, and human overreliance on their outputs can lead to serious consequences. For instance, a U.S. lawyer once submitted fabricated cases generated by ChatGPT, resulting in professional sanctions (ABC News, 2023). Recent social experiments demonstrate that people adjust their reliance on AI depending on how confident the model appears: reliable expressions of uncertainty can enhance trust, satisfaction, and task accuracy (Kim et al., 2024; Xu et al., 2025). These findings highlight the importance of associating reliable uncertainty estimates with LLM responses to support human decision-making. Ultimately, the conveyance of confidence plays a central role in shaping trust and guiding human-AI interaction. A growing body of work explores the extraction and representation of confidence in LLM outputs. These methods are simple and inexpensive but require access to model logits, which are typically unavailable in commercial LLM APIs. However, such scores rarely align with common user behavior or natural communication, as users do not typically phrase queries with explicit instructions like "Please output your confidence along with the answer."


Leveraging What's Overfixed: Post-Correction via LLM Grammatical Error Overcorrection

arXiv.org Artificial Intelligence

Robust supervised fine-tuned small Language Models (sLMs) often show high reliability but tend to undercorrect. They achieve high precision at the cost of low recall. Conversely, Large Language Models (LLMs) often show the opposite tendency, making excessive overcorrection, leading to low precision. To effectively harness the strengths of LLMs to address the recall challenges in sLMs, we propose Post-Correction via Overcorrection (PoCO), a novel approach that strategically balances recall and precision. PoCO first intentionally triggers overcorrection via LLM to maximize recall by allowing comprehensive revisions, then applies a targeted post-correction step via fine-tuning smaller models to identify and refine erroneous outputs. We aim to harmonize both aspects by leveraging the generative power of LLMs while preserving the reliability of smaller supervised models. Our extensive experiments demonstrate that PoCO effectively balances GEC performance by increasing recall with competitive precision, ultimately improving the overall quality of grammatical error correction.


SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

arXiv.org Artificial Intelligence

We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.